Skip to content

Merge feature/perf to master #6229

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 75 commits into from
Jan 31, 2025
Merged

Merge feature/perf to master #6229

merged 75 commits into from
Jan 31, 2025

Conversation

edwintorok
Copy link
Contributor

Relevant feature flags are off by default/match previous values:

coordinator_max_stunnel_cache
member_max_stunnel_cache
stunnel_cache_max_age
member_max_stunnel_cache
event_from_delay
event_from_task_delay
event_next_delay
use-event-next
use-xmlrpc
tgroups-enabled
timeslice

Draft PR to check conflicts, waiting for testing to complete.

edwintorok and others added 30 commits October 25, 2024 14:41
No functional change.

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
No functional change.

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
…queue

Wake up the scheduler immediately when there are more tasks.
Otherwise timeouts <10s may not work correctly, and it is difficult to test the
periodic scheduler if you need to wait 10s for it to start working.

If there are no tasks, then it will still sleep efficiently, but as soon as
more tasks are added (with [~signal:true], which is the default) it will
immediately wake up and calculate the next sleep time.

In practice it is probably quite rare for XAPI's queue to be empty (there are usually periodic tasks),
but we cannot rely on this.

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
…internal communication

Feature flag: use-xmlrpc

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
…owercase

Tasks are lowercase

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Subscription.object_matches already does it

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Creates a new library `Tgroup`, that abstracts and manages groups
of execution threads in xapi.

When xapi is under load, all the threads need to share a single cpu in
dom0 because of  ocaml runtime single-cpu restrictions. This library
is meant to orchestrate the threads in different priority groups.

Signed-off-by: Gabriel Buica <danutgabriel.buica@cloud.com>
`set_cgroup` adds the functionality of adding the current thread in a
cgroup based on its creator.

Signed-off-by: Gabriel Buica <danutgabriel.buica@cloud.com>
Add functionality of setting the tgroup based on a http header named
`originator`.

Signed-off-by: Gabriel Buica <danutgabriel.buica@cloud.com>
Signed-off-by: Gabriel Buica <danutgabriel.buica@cloud.com>
Clears the `extra_headers` of `UDSTransport` instance when making a
connection.

Previously, this was done only when tracing was enabled inside the
`with_tracecontext` method to avoid the header duplication when
`make_connection` was used multiple times. Currently, there is not
other use of `add_extra_headers` or other update to the
`_extra_headers`, making it safe to clear it when we make a new
connection.

(`xmlrpclib.Transport` updates the `_extra_headers` attribute only
inside `make_connection` method but we override this method with our
own for `UDSTransport`.)

Signed-off-by: Gabriel Buica <danutgabriel.buica@cloud.com>
XenAPI.py now passes an additional originator header when making
requests to xapi, if the "ORIGINATOR" env var is present.
Sm_exec now passes an env var, "ORIGINATOR=SM".

To classify the threads correctly, we first need to determine the
requests originators. This commit makes it possibly to explicitly
state the originators as headers.

Signed-off-by: Gabriel Buica <danutgabriel.buica@cloud.com>
For now, the thread executing `Xapi.server_init` and it's children are
classified as External. The only excception are http requests that come
through the smapi internally. If those contain the originator header
with the value set as "sm", the thread executing the request will be
classified as internal.

This represents the first phase of classifing xapi threads as internal
vs external.

Signed-off-by: Gabriel Buica <danutgabriel.buica@cloud.com>
Adds a configurable variable in `xapi_globs`, `tgroups_enabled` that is
meant to ask a guard for tgroup classification of the threads. If the
guard is `false` all Tgroups functionality should act as a no op.

For instance, adding the line:

tgroups-enabled = false

will result in the thread classification being skipped.

Signed-off-by: Gabriel Buica <danutgabriel.buica@cloud.com>
…n flag

This is more efficient: we can watch a single task, instead of everything in
the DB.

Feature-flag: use-event-next

No functional change.

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
int32_of_rpc doesn't accept Int32 as input, just Int because none of the current deserializers actually produce an Int32.
(Int32 is only used by serializers to emit something different).

This is an upstream ocaml-rpc bug that should be fixed, meanwhile convert Rpc.Int32 to Rpc.Int, so that the 'fake_rpc' inside XAPI
can use Event.from.

Otherwise you get this error:
```
Expected int32, got 'I32(0)
```

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
…vent.next

This is more efficient: we can watch a single task, instead of everything in
the DB.

Feature-flag: use-event-next

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
…cution threads. (#6076)

This is the follow up to
#6020

The initial phase is to classify the threads between Internal and
External.
- External is the default (for now),
- Internal are threads that process internal requests coming from smapi.

BVT + BST: 207007 (Dev Run)
… of Event.next

Feature flag: use-event-next

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
)

When XAPI's queue is empty the periodic scheduler would sleep for 10s,
during which time newly added jobs would get ignored.

This is mostly noticeable during unit tests, but we cannot rely on the
queue not being empty in production either.
It is better to wake up the scheduler on the soonest of these 2 events:
* after 10s
* when a new task gets added to the queue
Previously it'd only run when we added or removed entries, but on an
idle system we'd keep a large number of connections open that we don't
need, and this then exceeded the connection limits on the coordinator.

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Introduce separate coordinator_max_stunnel_cache and member_max_stunnel_cache settings,
and set these on XAPI startup based on host role.

No functional change, the defaults for both match the previous hardcoded
value (70).

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
To avoid labeled argument with ' in name

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
This parameter was passed through unchanged.
However VDI.epoch_begin/epoch_end doesn't actually need it, so drop it.

We intend to change how we compute the 'domain' parameter, and VDI.epoch_begin/end wouldn't have
sufficient information to compute it anymore.

No functional change.

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
We are going to change how we compute it, so factor it out into a function.

No functional change.

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
edwintorok and others added 21 commits December 12, 2024 13:51
No feature flag, because this is a deprecated API.
Clients that wants the best performance should've used Event.from.

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
This delay was right after we waited for a new event, delaying all event
responses by 50ms (including task completions).
Eliminate the first delay, so that if we find the event we're looking after the
DB update, then we can return immediately.

On spurious wakeups (e.g. not the event we subscribed for) the delay is still
useful, so keep it for recursive calls after the first one, and exponentially
increase it up to the configured maximum.

No feature flag, this is a relatively small change, and we use exponential backoffs
elsewhere in XAPI already.

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Give an opportunity for more fields to be filled, e.g. when waiting for a task
to complete, give a chance for the task to actually run.

No feature flag, it only changes timing.

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
We generate O(N^2) events when we update O(N) fields: each field update
generates an event including the entire object, even if later we are
going to change other fields of the same object.

Instead of returning the individual field update events immediately (and
generating a storm of events whenever an API client watcher for VM power
events), we batch these event updates by introducing a minimum amount of
time that successive Event.from need to have between them.
(The client is working as expected here: when it gets an event and
processes it, it immediately calls Event.from to get more events)

Although this doesn't guarantee to eliminate the O(N^2) problem, in
practice it reduces the overhead significantly.

There is one case where we do want almost immediately notification of
updates: task completions (because then the client likely wants to send
us more tasks).

This PR makes the already existing rate limiting in Xapi_event
consistent and configurable, but doesn't yet introduce a batching delay
for Event.from (it does for Event.next, which is deprecated). A separate
PR (or config change) can then enable this for testing purposes, but
also allows us to roll the change back by changing the tunable in the
config file.

There is also a new microbenchmark introduced here, I'll need to update
that with the latest results.
Uses Gc.Memprof to run a hook function periodically.
This checks whether we are holding any locks, and if not and sufficient time has elapsed since the last,
then we yield.

POSIX timers wouldn't work here, because they deliver signals, and there are too many places in XAPI
that don't handle EINTR properly.

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
And apply on startup.

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Signed-off-by: Edwin Török <edwin.torok@cloud.com>
Adds more granurality in the xapi thread classification. Now, an
individual thread can be mapped to the following:

- {xapi_cgroup}/internal/SM;
- {xapi_cgroup}/internal/cli;
- {xapi_cgroup}/external/intrapool;
- {xapi_cgroup}/external/unauthenticated;
- {xapi_cgroup}/external/authenticated/{identity};

where {identity} is a {auth_user_sid}/{user_agent}.
Both {auth_user_sid} and {user_agent} strings are sanitized when
making the identity through `Identity.make`, by allowing only
alphanumenric characters and a maximum length of 16 characters each.

This is only the library change.

This should allow for proper thread classification in xapi.

Signed-off-by: Gabriel Buica <danutgabriel.buica@cloud.com>
Adds unit test for:
- the correct thread classification of `of_creator`;
- sanitation of `Identity.make`.

Signed-off-by: Gabriel Buica <danutgabriel.buica@cloud.com>
Classifies the threads at the time of session creation and inside
`do_dispatch`.

This ensures that new threads created by current session/request inherit
the propper classification.

Note: threads created by xenopsd calling back into xapi are yet to be
classified.

Signed-off-by: Gabriel Buica <danutgabriel.buica@cloud.com>
Adds more granurality in the xapi thread classification. Now, an
individual thread can be mapped to the following:

- {xapi_cgroup}/internal/SM;
- {xapi_cgroup}/internal/cli;
- {xapi_cgroup}/external/intrapool;
- {xapi_cgroup}/external/unauthenticated;
- {xapi_cgroup}/external/authenticated/{identity};

where {identity} is a {auth_user_sid}/{user_agent}.
Both {auth_user_sid} and {user_agent} strings are sanitized when
making the identity through `Identity.make`, by allowing only
alphanumeric characters and a maximum length of 16 characters each.

This should allow for proper thread classification in xapi. 

Note: Threads calling back into xapi from xenopsd are yet to be
classified.

e.g. Cgroup hierarchy based on the new classification 

![image](https://github.com/user-attachments/assets/29774a5c-0581-405f-965c-4c7c823c6492)

BVT: 208947 & BST: 208972
Fixed conflict in scheduler.ml (code that got deleted in feature/perf
got modified in master)
Continuing from #6208

The conflict resolution can be reviewed with:
`git log --remerge-diff -1 8b8af63`

```diff
diff --git a/ocaml/libs/xapi-stdext/lib/xapi-stdext-threads/scheduler.ml b/ocaml/libs/xapi-stdext/lib/xapi-stdext-threads/scheduler.ml
remerge CONFLICT (content): Merge conflict in ocaml/libs/xapi-stdext/lib/xapi-stdext-threads/scheduler.ml
index 7bace30dae..27cf306 100644
--- a/ocaml/libs/xapi-stdext/lib/xapi-stdext-threads/scheduler.ml
+++ b/ocaml/libs/xapi-stdext/lib/xapi-stdext-threads/scheduler.ml
@@ -33,49 +33,8 @@ let (queue : t Ipq.t) = Ipq.create 50 queue_default

 let lock = Mutex.create ()

-<<<<<<< 6589d9a (Xapi thread classification - part 2 (#6154))
 let add_to_queue_span name ty start_span newfunc =
   let ( ++ ) = Mtime.Span.add in
-||||||| 4f3f08f
-module Clock = struct
-  let span s = Mtime.Span.of_uint64_ns (Int64.of_float (s *. 1e9))
-
-  let span_to_s span =
-    Mtime.Span.to_uint64_ns span |> Int64.to_float |> fun ns -> ns /. 1e9
-
-  let add_span clock secs =
-    (* return mix or max available value if the add overflows *)
-    match Mtime.add_span clock (span secs) with
-    | Some t ->
-        t
-    | None when secs > 0. ->
-        Mtime.max_stamp
-    | None ->
-        Mtime.min_stamp
-end
-
-let add_to_queue name ty start newfunc =
-  let ( ++ ) = Clock.add_span in
-=======
-module Clock = struct
-  let span s = Mtime.Span.of_uint64_ns (Int64.of_float (s *. 1e9))
-
-  let span_to_s span = Mtime.Span.to_float_ns span |> fun ns -> ns /. 1e9
-
-  let add_span clock secs =
-    (* return mix or max available value if the add overflows *)
-    match Mtime.add_span clock (span secs) with
-    | Some t ->
-        t
-    | None when secs > 0. ->
-        Mtime.max_stamp
-    | None ->
-        Mtime.min_stamp
-end
-
-let add_to_queue name ty start newfunc =
-  let ( ++ ) = Clock.add_span in
->>>>>>> 8c9b754 (SDK fixes including CP-53003 (#6210))
   let item =
     {
       Ipq.ev= {func= newfunc; ty; name}
```

This code got deleted in `feature/perf`
6b02474, where `span_to_s` got replaced
with `Clock.Timer.span_to_s`, and changed in master by
e68cda7.
Both achieve the same result: `span_to_s` uses `Mtime.Span.to_float_ns`,
except the commit on feature/perf also removes code duplication by
reusing the other clock module.
# Changing the default OCaml thread switch timeslice from 50ms

The default OCaml 4.x timeslice for switching between threads is 50ms:
if there is more than 1 active OCaml threads each one is let to run up
to 50ms, and then (at various safepoints) it can switch to another
running thread.
When the runtime lock is released (and C code or syscalls run) then
another OCaml thread is immediately let to run if any.

However 50ms is too long, and it inserts large latencies into the
handling of API calls.

OTOH if a timeslice is too short then we waste CPU time:
* overhead of Thread.yield system call, and the cost of switching
threads at the OS level
* potentially higher L1/L2 cache misses if we switch on the same CPU
between multiple OCaml threads
* potentially losing branch predictor history
* potentially higher L3 cache misses (but on a hypervisor with VMs
running L3 will be mostly taken up by VMs anyway, we can only rely on
L1/L2 staying with us)

A microbenchmark has shown that timeslices as small as 0.5ms might
strike an optimal balance between latency and overhead: values lower
than that lose performance due to increased overhead, and values higher
than that lose performance due to increased latency:


![auto_p](https://github.com/user-attachments/assets/3751291b-8f64-4d70-9a65-9c3fdb053955)

![auto_pr](https://github.com/user-attachments/assets/3b710484-87ba-488a-9507-7916c85aab20)

(the microbenchmark measures the number of CPU cycles spent simulating
an API call with various working set sizes and timeslice settings)

This is all hardware dependent though, and a future PR will introduce an
autotune service that measures the yield overhead and L1/L2 cache refill
overhead and calculates an optimal timeslice for that particular
hardware/Xen/kernel combination.
(and while we're at it, we can also tweak the minor heap size to match
~half of CPU L2 cache).

# Timeslice change mechanism

Initially I used `Unix.set_itimer` using virtual timers, to switch a
thread only when it has been actively using CPU for too long. However
that relies on delivering a signal to the process, and XAPI is very bad
at handling signals.
In fact XAPI is not allowed to receive any signals, because it doesn't
handle EINTR well (a typical problem, that affects C programs too
sometimes). Although this is a well understood problem (described in the
[OCaml Unix
book](https://ocaml.github.io/ocamlunix/ocamlunix.html#sec88), and some
areas of XAPI make an effort to handle it, others just assert that they
never receive one. Fixing that would require changes in all of XAPI (and
its dependencies).

So instead I don't use signals at all, but rely on Statmemprof to
trigger a hook to be executed "periodically", but not based purely on
time, but on allocation activity (i.e. at places the GC could run). The
hook checks the elapsed time since the last time it got called, and if
too much then calls Thread.yield.
Yield is smart enough to be a no-op if there aren't any other runnable
OCaml threads.

Yield isn't always beneficial though at reducing latencies, e.g. if we
are holding locks then we're just increasing latency for everyone who
waits for that lock.
So a mechanism is introduced to notify the periodic function when any
highly contended locks are held, and the yield is skipped in this
instance (e.g. the XAPI DB lock).

# Plotting code

This PR only includes a very simplified version of the microbenchmark, a
separate one will introduce the full cache plotting code (which is
useful for development/troubleshooting purposes but won't be needed at
runtime).

# Default timeslice value

Set to 5ms for now, just a bit above 4ms = 1/HZ in our Dom0 kernel, the
autotuner from a future PR can change this to a more appropriate value.
(the autotuner needs more validation on a wider range of hardware)

# Results

The cache measurements needs to be repeated on a wider variety of
hardware, but the timeslice changes here have already proven useful in
reducing XAPI DB lock hold times (together with other optimizations).
…l used internally

By default Event.next is still used internally, so although this API is deprecated do not yet enable the throttling by default.
Fixes: 3e1d8a2 ("CP-51692: Event.next: use same batching as Event.from")
Fixes: 2b4e0db ("CP-49158: [prep] Event.{from,next}: make delays configurable and prepare for task specific delays")

It slows down all synchronous API calls that create tasks, like VM.start.

Only enable the throttling when Event.next is not used internally (`use-event-next = false` in xapi.conf),
which will eventually become the default.

The code prior to the above changes used 0 delay between checking for events, so do the same here (although this lead to a lot of inefficient wakeups of all active tasks in XAPI, whenever anything changes, it matches previous behaviour)

Signed-off-by: Edwin Török <edwin.torok@cloud.com>
…l used internally (#6222)

This fixes a performance regression on `feature/perf` with all feature
flags disabled.

By default Event.next is still used internally, so although this API is
deprecated do not yet enable the throttling by default. Fixes:
3e1d8a2 ("CP-51692: Event.next: use same batching as Event.from")
Fixes: 2b4e0db ("CP-49158: [prep] Event.{from,next}: make delays
configurable and prepare for task specific delays")

It slows down all synchronous API calls that create tasks, like
VM.start.

Only enable the throttling when Event.next is not used internally
(`use-event-next = false` in xapi.conf), which will eventually become
the default.

The code prior to the above changes used 0 delay between checking for
events, so do the same here (although this lead to a lot of inefficient
wakeups of all active tasks in XAPI, whenever anything changes, it
matches previous behaviour)

(the code for Event.from used a 50ms delay, which matches the default
for that setting already)
Signed-off-by: Edwin Török <edwin.torok@cloud.com>
The conflict resolution can be reviewed by `git log --remerge-diff -1`:
```diff
diff --git a/ocaml/xapi-storage-script/main.ml b/ocaml/xapi-storage-script/main.ml
remerge CONFLICT (content): Merge conflict in ocaml/xapi-storage-script/main.ml
index 2dea78f513..ae725b9 100644
--- a/ocaml/xapi-storage-script/main.ml
+++ b/ocaml/xapi-storage-script/main.ml
@@ -1502,17 +1502,11 @@ let bind ~volume_script_dir =
     |> wrap
   in
   S.VDI.attach3 vdi_attach3_impl ;
-<<<<<<< 6ccaf7b (Update feature/perf with latest blocker fixes (#6237))
-  let vdi_activate_common dbg dp sr vdi' vm readonly =
-||||||| 1e5114c
-  let vdi_activate_common dbg dp sr vdi' vm' readonly =
-=======
   let dp_attach_info_impl dbg sr vdi dp vm =
     vdi_attach3_impl dbg dp sr vdi vm ()
   in
   S.DP.attach_info dp_attach_info_impl ;
-  let vdi_activate_common dbg dp sr vdi' vm' readonly =
->>>>>>> 6949dbd (Typo. Only throw assertions at Debug time. (#6262))
+  let vdi_activate_common dbg dp sr vdi' vm readonly =
     (let vdi = Storage_interface.Vdi.string_of vdi' in
      let domain = domain_of ~dp ~vm in
      Attached_SRs.find sr >>>= fun sr ->
@@ -1817,7 +1811,7 @@ let bind ~volume_script_dir =
         stat ~dbg ~sr ~vdi:temporary
     )
     >>>= fun response ->
-    choose_datapath domain response >>>= fun (rpc, datapath, uri, domain) ->
+    choose_datapath response >>>= fun (rpc, datapath, uri) ->
     if Datapath_plugins.supports_feature datapath _vdi_mirror_in then
       return_data_rpc (fun () ->
           Datapath_client.import_activate (rpc ~dbg) dbg uri domain
diff --git a/ocaml/xapi/xapi_globs.ml b/ocaml/xapi/xapi_globs.ml
remerge CONFLICT (content): Merge conflict in ocaml/xapi/xapi_globs.ml
index 3c96b1792a..c07c3d9 100644
--- a/ocaml/xapi/xapi_globs.ml
+++ b/ocaml/xapi/xapi_globs.ml
@@ -1692,7 +1692,6 @@ let other_options =
diff --git a/ocaml/xapi/xapi_globs.ml b/ocaml/xapi/xapi_globs.ml
remerge CONFLICT (content): Merge conflict in ocaml/xapi/xapi_globs.ml
index 3c96b1792a..c07c3d9 100644
--- a/ocaml/xapi/xapi_globs.ml
+++ b/ocaml/xapi/xapi_globs.ml
@@ -1692,7 +1692,6 @@ let other_options =
     , (fun () -> string_of_bool !disable_webserver)
     , "Disable the host webserver"
     )
-<<<<<<< 6ccaf7b (Update feature/perf with latest blocker fixes (#6237))
   ; ( "tgroups-enabled"
     , Arg.Set tgroups_enabled
     , (fun () -> string_of_bool !tgroups_enabled)
@@ -1701,14 +1700,11 @@ let other_options =
   ; event_from_entry
   ; event_from_task_entry
   ; event_next_entry
-||||||| 1e5114c
-=======
   ; ( "drivertool"
     , Arg.Set_string driver_tool
     , (fun () -> !driver_tool)
     , "Path to drivertool for selecting host driver variants"
     )
->>>>>>> 6949dbd (Typo. Only throw assertions at Debug time. (#6262))
   ]

 (* The options can be set with the variable xapiflags in /etc/sysconfig/xapi.
```
@edwintorok
Copy link
Contributor Author

Testing found no bugs unique to this feature branch (the bugs found are pre-existing bugs already affecting master).
Which is what we expected, because most features are turned off in this branch, so it is almost a no-op change, except some small bugfixes.

@edwintorok edwintorok marked this pull request as ready for review January 31, 2025 11:02
@edwintorok edwintorok enabled auto-merge January 31, 2025 11:02
@edwintorok edwintorok added this pull request to the merge queue Jan 31, 2025
Merged via the queue into master with commit d603928 Jan 31, 2025
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants